DMS: Create deployment and version during bundle deploy#5386
DMS: Create deployment and version during bundle deploy#5386shreyas-goenka wants to merge 1 commit into
bundle deploy#5386Conversation
Integration test reportCommit: caeb738
23 interesting tests: 13 SKIP, 7 KNOWN, 2 flaky, 1 RECOVERED
Top 4 slowest tests (at least 2 minutes):
|
98dc0c7 to
4a4382f
Compare
4a4382f to
de9adfd
Compare
de9adfd to
bb16fcd
Compare
bb16fcd to
bdc7ba2
Compare
bdc7ba2 to
1e8ba45
Compare
ff910cd to
ee860e8
Compare
1e8ba45 to
76bf017
Compare
|
|
||
| // The server validates that versionID equals last_version_id + 1 and returns | ||
| // ABORTED otherwise (e.g. a concurrent deploy already created this version). | ||
| version, versionErr := svc.CreateVersion(ctx, sdkbundle.CreateVersionRequest{ |
There was a problem hiding this comment.
this will not work well when the plan is serialized and potentially outdated (because we do not use serial here)
Will be fixed in a followup.
|
We can remove the traditional file based lock in a followup. Not necessary for now / preview. |
| return fmt.Errorf("failed to parse version ID %q: %w", versionID, err) | ||
| } | ||
| r.versionNum = versionNum | ||
| r.stopHeartbeat = startHeartbeat(ctx, r.svc, r.deploymentID, versionID) |
There was a problem hiding this comment.
We call CreateVersion twice: in deploy and destroy and seem to start heatbeat twice, shall we have only 1 instance of heartbeat?
There was a problem hiding this comment.
Can you clarify? Those are independent code paths and both need a heartbeat right?
There was a problem hiding this comment.
Ah right, my bad, indeed a separate processes
607bdd0 to
98f4444
Compare
| bundle.ApplyContext(ctx, b, lock.Release(lock.GoalDeploy)) | ||
| }() | ||
| if err := recorder.CreateVersion(ctx); err != nil { |
There was a problem hiding this comment.
When we remove file lock later on, how do we ensure that there's no race condition creating multiple versions? We don't seem to do any synchronisation / locking here now, is it part of follow up?
There was a problem hiding this comment.
The CreateVersion implicitly has locking semantics. Only one client can have a "live" version that is in progress at a time.
There was a problem hiding this comment.
The server-side version counter is the synchronization: CreateVersion only succeeds when the requested version is last_version_id + 1, otherwise it returns ABORTED (409). So even without the file lock, two concurrent deploys racing to create the next version would have one win and the other get ABORTED. Surfacing that as a clean user-facing error (retry/serial handling) is the follow-up I noted.
There was a problem hiding this comment.
Got it, make sense. Could you add an acceptance test for it though to make sure this behaviour is recorded?
98f4444 to
87c99e8
Compare
87c99e8 to
9a7feb1
Compare
| if db.Data.Lineage == "" { | ||
| db.Data.Lineage = uuid.New().String() | ||
| } | ||
| return db.Data.Lineage |
There was a problem hiding this comment.
This is initialized in Open() below.
There was a problem hiding this comment.
For direct yes. In case of DMS this is initialized at pla n time here instead. I did not want to touch direct deployment code paths.
There was a problem hiding this comment.
We now have a common function to initilaize the lineage. (GetOrInitializeLineage). I confirmed that this is stored in WAL.
9a7feb1 to
35f5a64
Compare
| if parseErr != nil { | ||
| return "", fmt.Errorf("failed to parse last_version_id %q: %w", dep.LastVersionId, parseErr) | ||
| } | ||
| versionID = strconv.FormatInt(lastVersion+1, 10) |
There was a problem hiding this comment.
We'll need to update this logic to also consider the existing serial numbers for migration scenarios. That'll be a followup.
| // Set up DMS recording of this destroy as a version. The version is not | ||
| // created until the destroy is approved (below), so a cancelled destroy | ||
| // records nothing; the deferred CompleteVersion is a no-op until then. | ||
| recorder := newDeploymentRecorder(ctx, b, engine, dms.VersionTypeDestroy) |
There was a problem hiding this comment.
Serial was already fixed when we created the plan, we should read it from there, not from the backend.
For clarity, we should also avoid storing serial/lineage on recorder object and just pass it to CreateVersion() directly from the plan, so that we have one source of truth for this info.
There was a problem hiding this comment.
Good callout, done. We use serial from plan. In a followup I'll add support for storing and reading previous_version_id from plan as well since that requires a SDK version updatE (to get previous_version_id)
| // Record the DMS version now that the plan is approved (a cancelled deploy | ||
| // records nothing). The deployment lineage and the version's serial both | ||
| // come from the plan; CompleteVersion below finalizes this same version. | ||
| if err := recorder.CreateVersion(ctx, plan.Lineage, plan.Serial); err != nil { |
There was a problem hiding this comment.
in a followup - we'll also store and set previous serial in plan - to ensure serializability.
Records each approved deploy/destroy as a version with the Deployment Metadata Service (DMS), gated by experimental.record_deployment_history and the direct engine. The version is created only after the plan is approved — a cancelled or declined deploy/destroy records nothing, so there are no empty/abandoned versions for operations that never ran. - libs/dms: Recorder with CreateVersion / CompleteVersion. The deployment ID is the state lineage (from GetOrInitLineage), so a bundle deployment maps one-to-one to a DMS deployment record. GetDeployment first, CreateDeployment only when missing, then create the next version; heartbeat keeps the version's lease alive; CompleteVersion records success/failure and, for destroy, deletes the deployment record on success. Independent of bundle/lock. - phases: newDeploymentRecorder builds the recorder from the bundle (nil unless the flag is set and the engine is direct); deploy/destroy create the version inside the approved branch (after UpgradeToWrite, so the lineage is already in the WAL) and complete it in the deferred lock release. - libs/testserver: in-memory DMS handlers under /api/2.0/bundle/... - acceptance/bundle/dms: deploy/redeploy/destroy record versions and hold the file lock; redeploy after deleting .databricks recovers the lineage from remote state; enabling the flag after a plain deploy creates a new deployment; a declined destroy records no version and does not delete the deployment. Co-authored-by: Shreyas Goenka <shreyas.goenka@databricks.com>
| // Package dms records bundle deployments as versions with the Deployment | ||
| // Metadata Service (DMS). | ||
| // | ||
| // It is intentionally independent of the deployment lock: a Recorder does not |
There was a problem hiding this comment.
in a followup we can deprecate the file based lock. Out of scope for now.
We need to track deployment state in the server. This PR just creates the scaffolding version and deployment when a user approves the plan to apply it.
Note: because lineage is persisted in WAL already we minimize the risk of orphaned deployments (first create request fails) becase we always read lineage from local WAL during replay.